Empirical Evaluation of Crf-based Bibliography Extraction from Research Papers
نویسنده
چکیده
We proposed an automatic bibliography extraction method for research papers scanned with OCR markup. The method uses conditional random fields (CRFs) to label serially OCRed text lines in the article title page as appropriate bibliographic element names. Although we achieved good extraction accuracies for some Japanese academic journals, extraction errors are inevitable. Therefore, this paper proposes three confidence measures for bibliography labeling to detect such extraction errors. This paper also reports an empirical evaluation of CRF-based page analysis for research papers on the basis not only of labeling accuracy but also of labeling error detection. We applied the three confidence measures to detecting errors of labeling articles selected from three academic journals published in Japan. The experiments showed that the proposed confidence measures reasonably indicated the labeling accuracies and could be used for error detection. This paper also discusses the tradeoff between the quality of bibliographic data assured by human post-editing of detected errors and its cost.
منابع مشابه
کارایی بیمارستانهای ایران: یک مرور نظام مند و متا آنالیز دو دهه پژوهش
Background and Aim: Increasing healthcare organizations’ efficiency is a necessity due to the resource scarcity in health sector. The aim of this study was to evaluate hospitals’ efficiency in Iran. Materials and Methods: This study was conducted using a systematic review and meta-analysis approach to find empirical research papers published on hospital efficiency in Iran between 1997 and 2016...
متن کاملWI&CRF: روش پیشنهادی برای استخراج اطلاعات مورد نیاز از متون نظامی
Military Information Extraction techniques are interested for military managers and commanders. But usual information extraction techniques cannot be used for that domain, because military corpus has special structure that differs from non-military corpus. In this paper the military documents structure is compared with non-military documents structure. Moreover a new classification is proposed ...
متن کاملStructured prediction models for RNN based sequence labeling in clinical text
Sequence labeling is a widely used method for named entity recognition and information extraction from unstructured natural language data. In clinical domain one major application of sequence labeling involves extraction of medical entities such as medication, indication, and side-effects from Electronic Health Record narratives. Sequence labeling in this domain, presents its own set of challen...
متن کاملA review on EEG based brain computer interface systems feature extraction methods
The brain – computer interface (BCI) provides a communicational channel between human and machine. Most of these systems are based on brain activities. Brain Computer-Interfacing is a methodology that provides a way for communication with the outside environment using the brain thoughts. The success of this methodology depends on the selection of methods to process the brain signals in each pha...
متن کاملA review on EEG based brain computer interface systems feature extraction methods
The brain – computer interface (BCI) provides a communicational channel between human and machine. Most of these systems are based on brain activities. Brain Computer-Interfacing is a methodology that provides a way for communication with the outside environment using the brain thoughts. The success of this methodology depends on the selection of methods to process the brain signals in each pha...
متن کامل